16 research outputs found

    Exact and efficient top-K inference for multi-target prediction by querying separable linear relational models

    Get PDF
    Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top-KK predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top-K retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach

    On the Bayes-optimality of F-measure maximizers

    Get PDF
    The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems

    Multivariate modeling to identify patterns in clinical data: the example of chest pain

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In chest pain, physicians are confronted with numerous interrelationships between symptoms and with evidence for or against classifying a patient into different diagnostic categories. The aim of our study was to find natural groups of patients on the basis of risk factors, history and clinical examination data which should then be validated with patients' final diagnoses.</p> <p>Methods</p> <p>We conducted a cross-sectional diagnostic study in 74 primary care practices to establish the validity of symptoms and findings for the diagnosis of coronary heart disease. A total of 1199 patients above age 35 presenting with chest pain were included in the study. General practitioners took a standardized history and performed a physical examination. They also recorded their preliminary diagnoses, investigations and management related to the patient's chest pain. We used multiple correspondence analysis (MCA) to examine associations on variable level, and multidimensional scaling (MDS), k-means and fuzzy cluster analyses to search for subgroups on patient level. We further used heatmaps to graphically illustrate the results.</p> <p>Results</p> <p>A multiple correspondence analysis supported our data collection strategy on variable level. Six factors emerged from this analysis: „chest wall syndrome“, „vital threat“, „stomach and bowel pain“, „angina pectoris“, „chest infection syndrome“, and „ self-limiting chest pain“. MDS, k-means and fuzzy cluster analysis on patient level were not able to find distinct groups. The resulting cluster solutions were not interpretable and had insufficient statistical quality criteria.</p> <p>Conclusions</p> <p>Chest pain is a heterogeneous clinical category with no coherent associations between signs and symptoms on patient level.</p

    Rough set approach to multiple criteria classification with imprecise evaluations and assignments

    No full text
    Dominance-based Rough Set Approach (DRSA) has been introduced to deal with multiple criteria classification (also called multiple criteria sorting, or ordinal classification with monotonicity constraints), where assignments of objects may be inconsistent with respect to dominance principle. In this paper, we consider an extension of DRSA to the context of imprecise evaluations of objects on condition criteria and imprecise assignments of objects to decision classes. The imprecisions are given in the form of intervals of possible values. In order to solve the problem, we reformulate the dominance principle and introduce second-order rough approximations. The presented methodology preserves well-known properties of rough approximations, such as rough inclusion, complementarity, identity of boundaries and precisiation. Moreover, the meaning of the precisiation property is extended to the considered case. The paper presents also a way to reduce decision tables and to induce decision rules from rough approximations.Dominance-based rough set approach Multiple criteria decision analysis Multiple criteria classification Ordinal classification Monotonicity constraints Decision rules Imprecise information Interval order

    On Missing Labels, Long-tails and Propensities in Extreme Multi-label Classification

    No full text
    The propensity model introduced by Jain et al has become a standard approach for dealing with missing and long-tail labels in extreme multi-label classification (XMLC). In this paper, we critically revise this approach showing that despite its theoretical soundness, its application in contemporary XMLC works is debatable. We exhaustively discuss the flaws of the propensity-based approach, and present several recipes, some of them related to solutions used in search engines and recommender systems, that we believe constitute promising alternatives to be followed in XMLC.Peer reviewe

    Efficient set-valued prediction in multi-class classification

    No full text
    In cases of uncertainty, a multi-class classifier preferably returns a set of candidate classes instead of predicting a single class label with little guarantee. More precisely, the classifier should strive for an optimal balance between the correctness (the true class is among the candidates) and the precision (the candidates are not too many) of its prediction. We formalize this problem within a general decision-theoretic framework that unifies most of the existing work in this area. In this framework, uncertainty is quantified in terms of conditional class probabilities, and the quality of a predicted set is measured in terms of a utility function. We then address the problem of finding the Bayes-optimal prediction, i.e., the subset of class labels with the highest expected utility. For this problem, which is computationally challenging as there are exponentially (in the number of classes) many predictions to choose from, we propose efficient algorithms that can be applied to a broad family of utility functions. Our theoretical results are complemented by experimental studies, in which we analyze the proposed algorithms in terms of predictive accuracy and runtime efficiency

    Extreme Classification (Dagstuhl Seminar 18291)

    No full text
    Extreme classification is a rapidly growing research area within machine learning focusing on multi-class and multi-label problems involving an extremely large number of labels (even more than a million). Many applications of extreme classification have been found in diverse areas ranging from language modeling to document tagging in NLP, face recognition to learning universal feature representations in computer vision, gene function prediction in bioinformatics, etc. Extreme classification has also opened up a new paradigm for key industrial applications such as ranking and recommendation by reformulating them as multi-label learning tasks where each item to be ranked or recommended is treated as a separate label. Such reformulations have led to significant gains over traditional collaborative filtering and content-based recommendation techniques. Consequently, extreme classifiers have been deployed in many real-world applications in industry. Extreme classification has raised many new research challenges beyond the pale of traditional machine learning including developing log-time and log-space algorithms, deriving theoretical bounds that scale logarithmically with the number of labels, learning from biased training data, developing performance metrics, etc. The seminar aimed at bringing together experts in machine learning, NLP, computer vision, web search and recommendation from academia and industry to make progress on these problems. We believe that this seminar has encouraged the inter-disciplinary collaborations in the area of extreme classification, started discussion on identification of thrust areas and important research problems, motivated to improve the algorithms upon the state-of-the-art, as well to work on the theoretical foundations of extreme classification
    corecore